Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information

نویسندگان

Youssef Bassil

Mohammad Alwani

چکیده

In computing, spell checking is the process of detecting and sometimes providing spelling suggestions for incorrectly spelled words in a text. Basically, a spell checker is a computer program that uses a dictionary of words to perform spell checking. The bigger the dictionary is, the higher is the error detection rate. The fact that spell checkers are based on regular dictionaries, they suffer from data sparseness problem as they cannot capture large vocabulary of words including proper names, domain-specific terms, technical jargons, special acronyms, and terminologies. As a result, they exhibit low error detection rate and often fail to catch major errors in the text. This paper proposes a new context-sensitive spelling correction method for detecting and correcting non-word and real-word errors in digital text documents. The approach hinges around data statistics from Google Web 1T 5-gram data set which consists of a big volume of n-gram word sequences, extracted from the World Wide Web. Fundamentally, the proposed method comprises an error detector that detects misspellings, a candidate spellings generator based on a character 2-gram model that generates correction suggestions, and an error corrector that performs contextual error correction. Experiments conducted on a set of text documents from different domains and containing misspellings, showed an outstanding spelling error correction rate and a drastic reduction of both non-word and real-word errors. In a further study, the proposed algorithm is to be parallelized so as to lower the computational cost of the error detection and correction processes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set

Since the dawn of the computing era, information has been represented digitally so that it can be processed by electronic computers. Paper books and documents were abundant and widely being published at that time; and hence, there was a need to convert them into digital format. OCR, short for Optical Character Recognition was conceived to translate paper-based books into digital e-books. Regret...

متن کامل

Real-Word Spelling Correction using Google Web 1T 3-grams

We present a method for detecting and correcting multiple real-word spelling errors using the Google Web 1T 3-gram data set and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm. Our method is focused mainly on how to improve the detection recall (the fraction of errors correctly detected) and the correction recall (the fraction of errors correc...

متن کامل

A Comparative Study of Bing Web N-gram Language Models for Web Search and Natural Language Processing

This paper presents a comparative study of the recently released Microsoft Web N-gram Language Models (MWNLM) on three web search and natural language processing tasks: search query spelling correction, query reformulation, and statistical machine translation. MWNLM, as well as the corresponding web services, called Microsoft Web N-gram Services, are much more accessible and easier to use than ...

متن کامل

Web-Scale N-gram Models for Lexical Disambiguation

Web-scale data has been used in a diverse range of language research. Most of this research has used web counts for only short, fixed spans of context. We present a unified view of using web counts for lexical disambiguation. Unlike previous approaches, our supervised and unsupervised systems combine information from multiple and overlapping segments of context. On the tasks of preposition sele...

متن کامل

Introduction to CKIP Chinese Spelling Check System for SIGHAN Bakeoff 2013 Evaluation

In order to accomplish the tasks of identifying incorrect characters and error correction, we developed two error detection systems with different dictionaries. First system, called CKIP-WS, adopted the CKIP word segmentation system which based on CKIP dictionary as its core detection procedure; another system, called G1-WS, used Google 1T uni-gram data to extract pairs of potential error word ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

Computer and Information Science

دوره 5 شماره

صفحات -

تاریخ انتشار 2012

Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information

نویسندگان

چکیده

منابع مشابه

OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set

Real-Word Spelling Correction using Google Web 1T 3-grams

A Comparative Study of Bing Web N-gram Language Models for Web Search and Natural Language Processing

Web-Scale N-gram Models for Lexical Disambiguation

Introduction to CKIP Chinese Spelling Check System for SIGHAN Bakeoff 2013 Evaluation

عنوان ژورنال:

اشتراک گذاری